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Abstract 

The Nearest subspace classifier (NSS) finds an estimation of the un¬ 
derlying subspace within each class and assigns data points to the class 
that corresponds to its nearest subspace. This paper mainly studies how 
well NSS can be generalized to new samples. It is proved that NSS is 
strongly consistent under certain assumptions. For completeness, NSS is 
evaluated through experiments on various simulated and real data sets, 
in comparison with some other linear model based classifiers. It is also 
shown that NSS can obtain effective classification results and is very effi¬ 
cient, especially for large scale data sets. Nearest Subspace, Classification, 
Consistency, Unsupervised Learning 


1 Introduction 

The problem of classification is to construct a mapping that can correctly pre¬ 
dict the classes of new objects, given training examples of old objects with 
ground truth labels [3B]. It is a classical problem in statistical learning and 
machine learning and has been widely used in computer vision, pattern recog¬ 
nition, bioinformatics, etc. Examples of applications include face recognition, 
handwriting recognition and micro-array classification. 

More precisely, this problem can be formalized as follows. Given a training 
data set {(x^,where e ft" and yi G y, the goal is to find a function 
f ■. X ^ y such that /(x) is a good approximation of y for the given x^’s as 
well as for new instances x. Typically, ft" is a continuous domain and 3^ is a 
finite discrete set. 

In the past few decades, a tremendous amount of work has been produced for 
this problem. Many approaches have been proposed, e.g., K-Nearest Neighbors 
(KNN) PU [T31 [IH] ) Fisher’s Linear Discriminant Analysis (LDA) [201111], Arti¬ 
ficial Neural Networks (ANN) [43l|56l|35], Support Vector Machines (SVM) |7l 
dmii], and Decision Trees (see [8l|40l|4T] for some well known algorithms). We 
refer to [1315] for a more careful overview of classification techniques. 
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Among this work is a class of methods based on subspace models. The 
compelling interest in subspace models can be attributed to their validation in 
real data. For instance, it has been justified that the set of all images of a 
Lambertian object (e.g., face images) under a variety of lighting conditions can 
be accurately approximated by a low-dimensional linear subspace (of dimension 
at most 9) [TH [53113]. Another example is that, under the affine camera model, 
the coordinate vectors of feature points from a moving rigid object he in an 
affine subspace of dimension at most 3 (see [H]). These applications give rise 
to modeling data by subspaces; the study of subspace based classifiers is an 
important branch. 

The first work in this category was CLAss Featuring Information Compres¬ 
sion (CLAFIC) [55] (also known as Nearest SubSpace (NSS) classifier [39] : 
for the information contained in this name, we will adopt the usage of NSS 
throughout the paper). In this algorithm, each class is represented by a lin¬ 
ear subspace and data instances are assigned to the nearest subspace. Instead 
of obtaining good representation of subspaces in NSS, the Learning Subspace 
Method (LSM) [5S] proposes to learn the subspaces based on good discrimina¬ 
tion (see m for more variants and discussions). The simple idea of subspace 
classifiers has been extended to nonlinear versions in various ways; many have 
shown state-of-the-art performance (see [3H][in]l33] for example and Section [531 
for more details). After the first subspace analysis of face images [55] [5T], clas¬ 
sification approaches with subspace models have been used successfully in face 
recognition [^, handwritten digit recognition |29j . speech recognition m as 
well as biological pattern recognition problems [38] . 

Although the design of subspace-based classification techniques has been 
actively explored, their theoretical justification is very under-studied. In this 
paper, we restrict our interests of justification to analyzing how well the clas¬ 
sifiers can be generalized to new samples. By doing so, one can learn quanti¬ 
tatively how reliable the classification approaches are and can thus also guide 
the algorithm design accordingly. For this purpose, a functional (known as risk 
function) is used to measure the prediction quality of every classifier. More 
precisely, we assume X and Y being random variables; instances and yi are 
drawn independently from the distributions of X and Y respectively. For a 
classifier f{x), its risk functional is defined as: 

R{f)=E^x,Y)Hf{X)^Y) 

Based on this, the optimal Bayes rule is defined to be the classifier whose risk 
functional is minimal. The Bayes rule is optimal in the sense that its expected 
loss (defined as I when the predicted class is not equal to the truth) is minimal. 
Note that, since the actual distribution of {X,Y) is unknown, the Bayes rule 
is thus not available in reality. A natural desirable property of practical classi¬ 
fiers is having as small risk functional as possible. In this spirit, the property 
consisteney is defined as the fact that the risk function converges to that of the 
optimal Bayes rule. In other words, classifier that is not consistent produces 
larger misclassification errors on average than the best scenario, no matter how 
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many data samples are available. Many classification algorithms, such as, KNN, 
SVM, LDA and some boosting methods [n Ezi m [531 [s [3], have been shown 
to be consistent under certain conditions. 

In this paper, we study the consistency property of the Nearest SubSpace 
(NSS) classifier. We prove its strong consistency under certain conditions. We 
also validate the performance of NSS through fruitful experiments, in com¬ 
parison with other linear classifiers, LDA, FDA and SVM. These experiments 
demonstrate that NSS has very effective and comparable performance as its 
better known and more popular competitors. Since the classifiers under con¬ 
sideration are all simple and fundamental ones, they are not state-of-the-art. 
However, they are very important components of classification and such an 
experimental comparison completes the understanding of NSS. For our best 
knowledge, an experimental comparison like this (between NSS and other typ¬ 
ical linear classifiers) has not been demonstrated yet. In the rest of the paper, 
we will begin with a description of the NSS algorithm (Section [5]), followed by 
its consistency analysis (Section [3]) and experiments (Section S]). 

2 The NSS Algorithm and its Strong Consis¬ 
tency 

For most of the applications, it suffices to assume that X C fi(0, M) C and 
y — {I,-- - jAT}, where B{0,M) is the ball centered at the origin with radius 
M and D and K are some positive integers. We will restrict ourselves to this 
case throughout the paper. 

2.1 The NSS Algorithm 

The NSS classifier assumes data lie on multiple affine subspaces, finds an es¬ 
timate for these subspaces and assigns each instance to the nearest subspace. 
The following is a summary of the NSS algorithm. 

Note that the closed form solution to o is the Singular Value Decomposition 
(SVD) of the centered data matrix for the class; such a data matrix consists 
of((xfei-Ufe), •••, (xfe„^ - Ufe)) with Xfcj, • • ■ e Cfe. 

2.2 The Main Theorem 

As mentioned in Section [U a desirable property for classifiers is consistency. 
Denote hn to be any classification rule determined from n samples {(x^, 

/* as the optimal Bayes rule, i.e., /* = argmin^i?(/) and R* := i?(/*) as its 
risk. Now we define strong consistency in the following sense. 

Definition 1 (Strong Consistency). A classification rule hn is said to be strongly 
consistent if 

R{hn) —^ R a.s. 
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Algorithm 1 Nearest Subspace (NSS) Classification 

Require: {(x^, C A x 3^ and d: intrinsic dimension, some positive 
integer and d < D. 

Ensure: A function f : X ^ y. 
for fc = 1 to A do 


Ufe 

= — Kp Ck = {xi : = k}; Uk = \Ck\. 



= argmin ||(I - BB'^)(Xj - Ufc)|p. 

(1) 


BeR°x"x,6C. 

B^B=Id 



end for 

/(x) = arg min 11(1-BfcBfc)(x-Ufe)|p. 


Since the NSS classifer is also based on n samples from now on, 

we denote it as /„ for it for the rest of the paper. Then we obtain the following 
theorem for the NSS classifier described in Algorithm [TJ 

Theorem 1. The NNS classifier fn is strongly consistent, i.e., R(fn) —t R* a.s., 
when the following assumptions hold. 

(1) (xi,yi),..., (x„, y„) are i.i.d. samples of random variable {X,Y); X G 
andY G {1,...,A}. 

(2) F{Y = z) = ^. 

(3) X\Y = k X p-Lfi *5 the underlying d-dimensional subspace 

for the class; pLk a uniform measure on Lk fl B{uk,M) (a bounded ball 
centered at Ufc, the underlying center for the class); is a measure on 

decreasing exponentially w.r.t. the sguare distance from Lk; 

This theorem reveals that the average prediction error of NSS converges to 
the optimal prediction error under certain conditions. It is a similar but slightly 
weaker result in contrast to that for LDA in [53], since the above condition (3) 
is stronger than that for LDA. Note that both results are about consistency for 
a class of distributions. On the other hand, the consistency results for KNN, 
SVM and some boosting methods are for all distributions, and thus are more 
general [4211461 ElE]. 

2.3 Discussions 

The NSS algorithm is a very simple and basic classification method, since it 
assumes linear structure in data. Linear models have their limitations, since 
the linearity constraint often is not satisfied in real data. However, they are 
important for the following reasons: (1) Linear classifiers are easy to compute 
and analyze. (2) They are a first order approximation for the true classifier. (3) 
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They often have good interpretations, critical in many applications. (4) Linear 
models are the best that can be done when the available training data are 
limited. (5) Linear models are the foundation from which more complex models 
can be generalized (see |23] for more discussion). Therefore, it is important to 
study this class of methods thoroughly, even if in practice they are no longer 
state-of-the-art. The computational complexity and the extensions of the NSS 
algorithm are further discussed as follows. 

Complexity. It is worth pointing out that the NSS algorithm is efficient. 
Assuming D < n, the computational complexity of the training process of NSS 
is 0{KD^{nk -b 21?)) where hfc = LDA and FDA have similar complexity 

since they all require some eigen- or singular value decomposition operations. 
On the other hand, SVM requires 0{v?) to 0{n^) operations. Therefore, for 
large scale (n large and n :s> D) problems, the computation of NSS is much 
faster than for linear SVM. In cases when the data is of large scale and some 
sensible results are needed quickly, NSS is a good choice. Section |4] will provide 
more details of the performance in terms of both accuracy and speed of the 
algorithms. 

Extensions. The NSS method has been modified and extended through dif¬ 
ferent methods: localization, the kernel trick and the hybrid model. The local 
subspace methods find, for the investigated data sample, their nearest neighbors 
in each class and attribute by their distances to the subspace spanned by these 
neighbors [45l [29l EH [TOl El] . Due to the fact that only an inner product is 
needed in the NSS algorithm, it can be naturally extended by the kernel trick, 
where the original data are embedded into a higher dimensional space and sub¬ 
space structures are learned there [521 EOl El EH EH ; these two techniques are 
combined in |58] . Another direction is to represent each class by multiple sub¬ 
spaces [221 SHI ED , where [25] also uses a more general metric than the Euclidian 
distance. All of these extended techniques define nonlinear decision boundaries 
and the recent works SHliniEl] have shown their state-of-the-art performance. 

3 Proof of Theorem [1] 

In this section, we give a complete proof of Theorem [T] following [SI] . 

3.1 Notations 

We first describe the problem in detail and prepare to prove the theorem. Con¬ 
sider a classification problem, where the goal is to assign an individual instance 
to one of K classes, given n observations of (X, Y). To do this, the space is 
partitioned into subsets Hi,, Hk such that, ioi k = \,... ,K, the individual 
instance is classified to be in group k when X G Hk- This procedure generates 
a discriminant rule as a mapping / : R^ —>■ {I,..., AT} that takes the value 
f{X) = k whenever the individual is assigned to the group, and this can be 
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written as f{X) — J2k=i where 1h,,{X) is the indicator function of 

the subset Hk- 

Let Y be the discrete random variable (class index or group label) which 
represents the true membership of the individual under study. Denote the class 
prior probabilities = P[y = fc] > 0, J2k=i ’’’fe = 1 k = 1,... ,K. Further¬ 
more, assume there exist density functions gk{X) such that F[X G A\Y = k] = 
gk(X)dX, k = 1,... ,K for A, a subset of K^. 

Given (X,Y), the rule f(X) = kln^iX) is in error when f(X) ^ Y 

and its probability of misclassification is computed as: 


i?(/) 


E(j,,y)l(f(X)^Y)=P[f(X)^Yj 

K 

&Hk,Y = k] 

k^l 

K 

1 - ^P[r = k]P[x e Hu\Y = k] 



gk{X)dX. 


1 - P[f{X) = Y] 


( 2 ) 


The rule /* = (X) that minimizes ([2]), or the Bayes rule, is given 

by the partition 


Hk=[X ■■ TTkgkiX) = max TTjgj{X)], k=l,...,K. 

1<J<K 

Then the corresponding optimal error is: 


R* 


K 

R[r{X)] = l-Y,n, 

/c=l 



gk{X)dX. 


In general, both tt^ and gk are unknown, so rules used in practice are sample 
based rules of the form fn{X) = „ (^)’ where the subsets Hk,n 

depend on the data set formed by n i.i.d. observations from 

{X,Y). The appropriate measure of error of a sample rule fn(,X) is i?„ = 
niniX) ^ Y]. 


3.2 Proof of Theorem [T] 

We will first prove a useful lemma which gives a bound for — R*. 

Lemma 1. Assume = -^ and let gi~^n{X) he an estimate of gk{X) from 
rtn , for k = 1, ■ ■ ■ , K. Let /„(X) be the classifier derived from gk,n{X), i.e., 
fn{X) = argmaXfc 5 fc_„(X). Then 

1 ic „ 

0 < i?„ - i?* < - ^ / \gk{X) - gk,n{X)\dX. 

k=l 
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Proof. Since TTfc = we have Hf. = [X : gk{X) = gj{X). Thus, 

R* = [ 9k{X)dX = 1 - 1 /max5fe(X)dX 

- J = 

On the other hand, 


Rn-R* = 


< 


^ / (^^^9k{X) -gf^_^{X))dX + ^ J (5/„.n(^) -9f„,n(^))dX 

^ /(max5fc(X) -max5fc,„(X))dX + ^ j(9- 9f^,„iX))dX 
1 if „ 

-J2 / \9k{X)-gk,n{X)\dX. 
k=l '' 


□ 


A similar result of Lemma [T] can be found in the Theorem 1 in |16) (p. 254). 
Now we prove the main theorem of our paper. 

Proof of Theorem[^ Due to condition (2), we have 

iJfc* = [X : gk{X) = max a,(X)]. 

i<i<^ 

On the other hand, based on the assumption (3), the density functions can 
be written as 


gk{X) = C{d)f3exp{-at), 
t={X - Ufe)^(/ - Pfc)(X - Uk) 


for some a > 0, /3 independent of t, constant C{d) and P/c = with B^ 

being the orthonormal basis for Lk- 

Then the classifier generated by the Algorithm [T] can be written as: 

if 

fn{X) = ^ 
k=l 

with the following notation: 

Pfe = BfcB^ 

h,n{^) = C'(d)/3exp {-a{X - UkY {I - 'Pk){X - Uk)) 

Hk,n = [X : h,n{X) = max 
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Thus the NSS classifier can be considered as a plug-in version of the Bayes 
rule. By Lemma [U the difference — R* can be bounded in the form 

1 if » 

< - V / \gk{X) - gk,n{X)\dX 

For each fixed 1 < k < K, we have 

0< / \gkiX) - gk,„{X)\dX < f gkiX) + gk,niX)dX < ^ 

Jrd Jrd 

Therefore, it suffices to show that gk^n gk a.s and due to the continuity 
of g{-), to show Ufc ^ Ufe and ^ a.s. The fact that Ufc and Pfc are the 
maximum-likelihood estimations (MLE) of and Pfc completes the proof. □ 

4 Experiments 

In this section, we evaluate the performance of the NSS algorithm through 
various experiments and compare it with LDA, FDA and linear SVM. The 
purpose of demonstrating these results is two-fold. First, it is to show that, as 
a simple and basic method, the NSS algorithm can obtain very useful results 
and is comparable to its competitors. Second, it serves as a complementary 
perspective to the theoretical portion of the paper. The reason why we include 
LDA, FDA and linear SVM in the comparison is because they are similar to 
NSS. Note that the objective is not to prove NSS is state-of-the-art. On the 
other hand, the significance of studying this method has been fully discussed in 
Section 1^31 

4.1 Data 

We test the classification methods on two simulated data sets and five real data 
sets. In the following, we will give a brief description for each of them and a 
summary of size, dimension and the number of classes can be found in Table [TJ 

Mixture Gaussian. Data samples are generated from K = i Gaussian distri¬ 
butions in with means = (1, 2, 3)^, /r 2 = (—1, —2, —3)”^, /is = (—1, 2, —3)^ 

/ 3 0.2 0.1\ /2 0 0\ /2 0 0\ 

and variances Ei = j 0.2 2 0.2 j,E 2 =jo 1 oj,E 3 =jo 2 oj. 

\0.1 0.2 2 / \0 0 1/ \0 0 3/ 

The total number of samples is 1200; 400 in each class. 

Multiple Subspaces. For the multiple subspaces experiment, data are gen¬ 
erated uniformly from 3 2-dimensional linear subspaces (bounded in a unit disk) 
in The angles between the subspaces are at least Gaussian noise with 0 
mean and 0.05 standard deviation is added. Again, 1200 samples are generated 
in total with 400 in each class. 


Wine. Wine recognition data are the results of a chemical analysis of wines 
grown in the same region in Italy but derived from three different cultivars. The 
goal is to determine the types of wines from the quantities of 13 constituents 
found in them. The data were first collected in [22] and now can be found in 
the UCI machine learning repository. 

DNA. We use the Statlog version of the primate splice-junction DNA data set 
(found in [4^). The problem is to recognize, given a sequence DNA, the bound¬ 
aries between exons (the parts of the DNA sequence retained after splicing) and 
introns (the parts of the DNA sequence that are spliced out). The features are 
binary variables representing nucleotides in the DNA sequence. Three classes 
are “intron to exon” boundary, “exon to intron” boundary and neither. This 
data set has three subsets, training, evaluation and testing. All of them are 
used in our experiments. 

USPS. USPS [15] is a database of scanning images of handwritten digits from 
US Postal Services envelopes. The goal is to recognize digits given their 16 x 16 
grayscale images. Both the training and testing sets are used in our experiments. 

Vehicle. This Vehicle data set m collects signals obtained by both acoustic 
and seismic sensors and the goal is to classify vehicle types from the original 
data. It has two subsets, training and testing, and we use both of them in our 
experiments. 

News20. The 20 Newsgroups data set is a collection of approximately 20,000 
newsgroup documents, partitioned (nearly) evenly across 20 different news- 
groups. The problem imposed here is to recognize the newsgroups from texts. 
This data was originally collected in [30]. Due to the very large scale of the 
data, we use only the testing set in our experiment for simplicity. 


Table 1: A summary of the data sets 


data 

data size 
(# of samples) 

# of classes 

ambient dimension 
( # of features) 

reduced 

dimension 

Mixture Gaussian 

1200 

3 

3 

NA 

Multiple Subspaces 

1200 

3 

50 

NA 

Wine 

178 

3 

13 

NA 

Vehicle 

98,528 

3 

100 

NA 

DNA 

3,186 

3 

180 

NA 

USPS 

9,298 

10 

256 

38 

News20 

3,993 

20 

62,060 

1000 
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4.2 Implementation Details 

Real data used in our experiments are originally from the UCI machine learning 
repository, Statlog and other collections. We download them from [53], where 
data samples have been scaled linearly to be within [0, 1] or [-1, 1]. For the 
data sets USPS and News20, the ambient dimension is reduced by Principal 
Component Analysis to be at most 1000 and such that 95% variance is explained. 
The reduced dimension for the USPS and News20 data sets is shown in the 
last column of Table |TJ For NSS, the intrinsic dimension of the subspaces is 
determined by 10-fold cross validation. 

The classification experiments are carried out in Matlab. We use the default 
function classify of the Statistics toolbox for LDA. For multiclass FDA and 
SVM (see [32| and [11]), we use implementations from [25] and [31]. The NSS 
can be simply realized and the version we use can be found from the author’s 
homepage http://www.math.duke.edu/-yiwang/ 

4.3 Results 

For each data set, we randomly split it into two subsets, each with 80% and 20% 
of the data, and use the former as the training set and the latter as the testing 
set. All experiments are repeated 200 times for the simulated data sets and 10 
times for the real data sets, including the random generation (for the simulated 
data sets) and the random splitting processes. The mean and standard deviation 
of the accuracy of all methods under investigation are reported in Tabled] while 
the running time for the training process is recorded in Tabled] 


Table 2: A summary of classification results: mean accuracy ± standard devi¬ 
ation (%) 


Data 

Methods 

NSS 

LDA 

FDA 

SVM 

Gaussian 

Subspace 

88.11 ± 4.47 
99.16 ± 0.51 

95.12 ± 1.58 

34.97 ± 2.73 

81.43 ± 3.74 
33.74 ± 2.26 

94.93 ± 1.58 
46.57 ± 3.23 

Wine 

Vehicle 

DNA 

USPS 

News20 

94.29 ± 3.81 
74.23 ± 0.34 

90.28 ± 1.05 
96.54 ± 0.39 
75.18 ± 1.60 

98.57 ± 2.02 
80.15 ± 0.22 
93.23 ± 1.02 
91.19 ± 0.58 
35.55 ± 1.72 

92.57 ± 3.86 
75.11 ± 0.33 
78.59 ± 2.07 

48.58 ± 1.01 

9.50 ± 1.39 

96.29 ± 1.93 
NA 

91.11 ± 0.94 
94.02 ± 0.48 
75.54 ± 1.26 


From the above results, we know that the NSS algorithm can obtain results 
comparable to its better known competitors LDA and SVM, for a broad range 
of classification problems. Meanwhile, the computation is very fast, roughly the 
same order as FDA and LDA, but significantly faster than SVM, especially for 
large scale problems. Additionally, LDA requires that the covariance matrix 
is positive definite, which is not satisfied in some high dimensional data sets. 
This is another reason why we reduce the ambient dimension for the USPS and 
News20 datasets. However, NSS does not have this restriction. 
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Table 3: A summary of running time on real data sets (seconds) 


Data 

Methods 

NSS 

LDA 

FDA 

SVM 

Wine 

8.175 X 10-" 

0.012 

0.019 

0.654 

Vehicle 

1.064 

1.206 

3.449 

NA 

DNA 

0.042 

0.037 

0.212 

73.685 

USPS 

0.035 

0.052 

0.148 

935.296 

News20 

1.089 

1.018 

12.841 

168.209 


5 Conclusion 

In this paper, we reviewed a simple classification algorithm (NSS) based on the 
model of multiple subspaces. We proved its strong consistency under certain 
conditions, which means that under these conditions, the prediction error of NSS 
on average converges to that of the optimal classifier. Finally, we evaluated NSS 
on various data sets and compared it with its competitors. Results showed that 
NSS can obtain very useful results efficiently, especially for large scale data sets. 

By studying the consistency property of NSS, we are inspired to further ex¬ 
plore subspace-based classification methods along the following directions in the 
future. First, NSS finds a good estimation for the underlying subspace models 
by minimizing the sum of squares of fitting errors. However, for the purpose 
of classification, it is more helpful to obtain models which can “separate” or 
“discriminate” classes. Therefore, in order to improve the classification perfor¬ 
mance, some separation measure can be taken into account. In fact, an advanced 
supervised learning method based on multiple subspaces has been proposed [JS] • 
It would be fruitful to analyze this method or other variants theoretically. 

Moreover, a general way to find a good classifier is to minimize an empirical 
risk function, which is typically defined as Rempif) = l(/(^i) 7 ^ Vi)- This 

idea can be combined with the multiple subspaces model. Similar approaches 
to that in (52] can be applied to analyze its consistency. 
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